Text Categorization with Class-Based and Corpus-Based Keyword Selection
نویسندگان
چکیده
In this paper, we examine the use of keywords in text categorization with SVM. In contrast to the usual belief, we reveal that using keywords instead of all words yields better performance both in terms of accuracy and time. Unlike the previous studies that focus on keyword selection metrics, we compare the two approaches for keyword selection. In corpus-based approach, a single set of keywords is selected for all classes. In class-based approach, a distinct set of keywords is selected for each class. We perform the experiments with the standard Reuters21578 dataset, with both boolean and tf-idf weighting. Our results show that although tf-idf weighting performs better, boolean weighting can be used where time and space resources are limited. Corpus-based approach with 2000 keywords performs the best. However, for small number of keywords, class-based approach outperforms the corpus-based approach with the same number of keywords.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملComparison of text feature selection policies and using an adaptive framework
0957-4174/$ see front matter 2013 Elsevier Ltd. A http://dx.doi.org/10.1016/j.eswa.2013.02.019 ⇑ Corresponding author. Tel.: +90 (212) 359 7094. E-mail addresses: [email protected] (S . (T. Güngör). Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categoriz...
متن کاملA Technique for Proper Feature Selection with Automated Text Categorization in the Vector Space Model
Efficient and effective text categorization and information retrieval techniques are very important and play a major role in managing the ever increasing amount of data and textual information available in digital form. Text categorization has important applications like information retrieval, bad information identification, document and web resource filtering. Before the application of various...
متن کاملClassification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords
In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf ...
متن کاملA feature selection approach based on term distributions
Feature selection has a direct impact on text categorization. Most existing algorithms are based on document level, and they haven't considered the influence of term frequency on text categorization. Based on these, we put forward a feature selection approach, FSATD, based on term distributions in the paper. In our proposed algorithm, three critical factors which are term frequency, the inter-c...
متن کامل